What we’re doing

This markdown documents the journey of cleaning even a small amount of TEROS data for the TEMPEST 2 flooding events. Initial attemps to clean these data resulted in too many decision points, and thus the need to carefully document and justify how we are deciding to clean these data. Ideally, this will help writing associated methods, and explaining decisions to co-authors.

First visualization of all TEROS data

Looking at all the TEROS data from the time-period of interest, it’s clear that there’s a lot going on, including non-responsive sensors, high intra-plot variability, and different responses to the flooding events.

5-minute data

There are two immediate issues with the 5-minute data: 1) static sensors with high VWC (> 0.75), and 2) many of the sensors are missing data for some or most of the time-period of interest. This is a bummer, since the 5-minute data temporally matches our DO and redox datasets.

15-minute data

The 15-minute data looks generally better in terms of continuity across the time-frame of interest, but has the same issue with flatlined sensors.

QC Step 1: remove high-VWC flatlined sensors

First step is the easiest, which is to scrub the high VWC sensors. Conveniently, they’re all with values of 0.75, and none of the other sensors are, so we can easily trim those out. While we’re at it, let’s make plots for the other two variables as well, to see if they need initial cleaning:

Volumetric Water Content

Electrical Conductivity

Soil Temperature

QC Step 2: remove high temp sensor in Seawater

We have one rogue sensor reading really high temperatures >30, which doesn’t make sense and is a clear outlier. Let’s also remove that sensor.

Decision point: merging 5-minute and 15-minute datasets

There are a couple potential routes here:

  1. Use only 15-minute data: these datasets appear to be complete, and generally clean, though we lose all 5-minute resolution, which means we effectively drop to 15-minute resolution for all datasets when comparing to any TEROS dataset.
  2. Use only 5-minute data: this is a bad option, because we’ve got limited coverage.
  3. Merge datasets, and include all 15-minute data, while filling in with 5-minute data
  4. Merge datasets, and include all 5-minute data, filling in with 15-minute data

IF the 15-minute and 5-minute data are comparable, then either 3 or 4 should be used. If they aren’t we should used 1. So, let’s compare them:

Decision

I’m going to go with #3 above, because the 15-minute record is complete, so even though the same record for either 5-min or 15-min should be the same, we will keep all 15-minute records where possible (and thus the full time-frame) and fill in 5-min values as available. My main concern is we will now have unequal sample sizes for different periods of the same length (i.e. 6/3, with 5-minute data missing, a day will have 4x24=96 values, while 6/9 will have 12*24=288 values). This likely doesn’t matter much since we have such high sample-sizes across our datasets, but something to keep in mind. Let’s revisualize our newly merged dataset

VWC

EC

Temp